arXiv:1504.06706v2 [math.ST] 28 Apr 2015 


Penalized Likelihood Estimation in High-Dimensional 
Time Series Models and Its Application 

Yoshimasa Uematsu* 

Institute of Statistical Mathematics 
April 29, 2015 


This paper presents a general theoretical framework of penalized quasi-maximum 
likelihood (PQML) estimation in stationary multiple time series models when the num¬ 
ber of parameters possibly diverges. We show the oracle property of the PQML estima¬ 
tor under high-level, but tractable, assumptions, comprising the first half of the paper. 
Utilizing these results, we propose in the latter half of the paper a method of sparse esti¬ 
mation in high-dimensional vector autoregressive (VAR) models. Finally, the usability 
of the sparse high-dimensional VAR model is confirmed wifh a simulafion sfudy and an 
empirical analysis on a yield curve forecasl. 

Keywords: Penalized likelihood, diverging parameters, stationary process, vector autore¬ 
gression, yield curve forecasting. 

JEL classification: C13, C32. C55, C58 


*Address correspondence to Yoshimasa Uematsu, The Institute of Statistical Mathematics, 10-3 Midori- 
cho, Tachikawa, Tokyo 190-8562, Japan. E-mail: uematsu@ism.ac.jp. 


1 



1 Introduction 


Many statistical models have been developed to capture the relationships between variables 
within a multiple time series, as well as their individual marginal behaviors. These types of 
models help researchers identify the probabilistic structure of the interdependent relation¬ 
ship between variables along the axis of time. Typical examples are vector autoregressive 
(VAR) models and multivariate generalized autoregressive conditional heteroskedasticity 
(MGARCH) models. Researchers may frequently be required to estimate such models un¬ 
der a high-dimensional setting. However, it is quite difficult to accurately estimate all the 
coefficients at the same time without some parameter restrictions, even when the dimension 
is not so high. 

Attempts to address the challenges presented by high dimensionality have been ad¬ 
dressed in many works. One solution is to utilize a factor structure of multiple time se¬ 
ries. The most famous result may be the approximate factor models that stem from classical 
factor analysis; see the review by Bai and Ng (2008) and references therein. Another ex¬ 
ample that uses a factor structure is a variant of the dynamic Nelson-Siegel model, which 
attempts to parsimoniously model a term structure of interest rates; see the comprehensive 
book by Diebold and Rudebusch (2013). These factor-based approaches have gained great 
success in high-dimensional time series econometrics, specifically in forecasting. However, 
there are limits to what can be done by the factor models alone, and an alternative method, 
such as the direct use of a (large dimensional) VAR model, undoubtedly needs to be estab¬ 
lished. Accordingly, the present paper provides development in that direction. A penalized 
likelihood method is proposed for dimension reduction that is applicable to a variety of 
high-dimensional multivariate time series models. 

Looking at the history of likelihood-based approaches, methods of model selection may 
have originally been informed by the use of information criteria, like the AIC and BIC. 
They have become very popular due to their tractability, but these methods are limited when 
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dealing with high-dimensional models since they demand an exhaustive search over all sub¬ 
models. This difficulty led to a general penalized likelihood method that brought about 
simultaneous estimation and model selection. One of the most popular methods may be 
the Li-penalization investigated by Tibshirani (1996). His approach was originally intro¬ 
duced in the context of penalized least squares and is well-known as the Lasso. For more 
information, see Fan et al. (2011). 

After that, desirability of penalties increased, and their applicability to higher-dimensional 
models have been pursued. Fan and Li (2001) explored a statistically desirable concept 
called the oracle property, in which the estimator is asymptotically equivalent to the maxi¬ 
mum likelihood estimator of a correct submodel. They also advocated the smoothly clipped 
absolute deviation (SCAD) penalty that gives an estimator with the oracle property, but they 
pointed out that the Li-penalty might not produce the oracle property in general. Their re¬ 
sults have been extended by many authors. In terms of the formulation of penalties, Lv and 
Fan (2009) proposed a wide class of folded-concave penalties, including the Li-penalty and 
SCAD. Fan and Peng (2004) improved the SCAD-penalized likelihood to admit a dimen¬ 
sionality growing with the relatively slow rate o(n^/^) or where n is the sample 

size. Likewise, a faster rate has been examined by several authors as well. Kim et al. (2008) 
considered the case where the dimension is possibly larger than the sample size, growing at 
a polynomial rate, and Fan and Lv (2011) studied ultrahigh-dimensionality that grows non- 
polynomially fast. Such penalized estimation methods are definitely useful when managing 
high-dimensionality in statistics and econometrics studies. Nevertheless, all the results have 
been derived under i.i.d. assumptions and have not been extended to a general time series 
framework, to the best of our knowledge. 

Responding to this context, the first half of this paper proposes a general framework of 
penalized quasi-maximum likelihood (PQML) estimation that can be applicable to many 
kinds of stationary multivariate time series models. As in the preceding research in the 
i.i.d. context, the number of parameters, p, is assumed to diverge with a rate proportional 
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to a polynomial of sample size T. Thus, p is possibly larger than T, as long as the pa¬ 
rameter vector is sparse. PQML finds an estimator that, by definition, maximizes a quasi- 
log-likelihood function with a penalty characterized by the broad class of folded-concave 
penalties introduced by Lv and Fan (2009). Hence, we can include many types of penalties 
other than the Lasso-type one. As a main theoretical contribution, we show the existence of 
this estimator and its oracle property under general high-level, but tractable, assumptions. 
The derivation of the results is based on Fan and Lv (2011), but the extension is not trivial 
since our setting is not limited to i.i.d. cases. Furthermore, contrary to the works intro¬ 
duced above, this paper employs a quasi-likelihood, which makes possible a wide range of 
applications—particularly in the analysis of VAR models. 

Moving beyond general time series models to more specific ones, there are several ar¬ 
ticles that apply the penalized estimation method to obtain sparse estimates. Wang et al. 
(2007) adapted the Lasso to shrink the coefficients in a univariate regression model with 
autoregressive errors while assuming a fixed lag order. This work was extended by Nardi 
and Rinaldo (2011) to let the maximal lag grow. Recently, Medeiros and Mendes (2012) 
have further extended the result to permit more general processes, but still the method only 
applies to univariate models. Estimation of cointegrating regressions with a Lasso-type 
penalty has been studied by Mendes (2011)—considering a scalar-valued model with di¬ 
verging parameters—and Liao and Phillips (2012)—investigating cointegrating rank and lag 
order selection of a vector error correction model. Again, these studies are also restricted 
to fixed dimensional models. Lasso-type estimation of VAR models has been studied by 
several authors, including Song and Bickel (2011), Audrino and Camponovo (2013), Basu 
and Michailidis (2013), and Kock and Callot (2014). Kock and Callot (2014), in particular, 
were theoretically successful in deriving oracle inequalities of estimated VAR coefficients 
with Lasso-type penalties. The latter half of the present paper relates to these works. 

After establishing our framework, we demonstrate sparse estimation in a high-dimensional 
VAR model as a theoretical example of the results obtained in the first half of the paper. As 
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described above, there are several studies of sparse estimation in VAR models. Neverthe¬ 
less, our result is not just a replication of those papers—it is an entirely new addition to the 
literature. First, contrary to a number of papers, our analysis allows the dimension and lag 
order to diverge, similar to Basu and Michailidis (2013) and Kock and Callot (2014). Thus, 
high-dimensional VAR models can be handled within our framework. Second, the errors 
are assumed to have only finite fourth moments, as opposed to the two papers just men¬ 
tioned, in which the authors supposed Gaussian errors. This assumption is essential to their 
contributions and proofs, which emphasizes the difference of this paper. Third, our result 
includes many kinds of penalties, such as both SCAD and Lasso, while all the above papers 
employed only Lasso-type penalties. In addition, it is noteworthy that we can include anal¬ 
yses of both ordinary least squares (OLS) and generalized LS (GLS) estimations, thanks to 
the utilization of quasi-likelihood functions. In the well-known example of Zellner (1962), 
the OLS estimator is shown to be identical to the GLS estimator in an unrestricted regres¬ 
sion in VAR models. However, when it comes to penalized estimation, a difference arises 
between these two methods in VAR estimation under parameter constraints. Note, though, 
that our developing theory is not limited to VAR estimation. Another possibility, such as an 
application to MGARCH, is briefly discussed in the end of this paper as well. 

After verifying the oracle property of the PQML estimator in a high-dimensional VAR 
model, we proceed to a simulation study and give an empirical example. We then observe 
from a simulation study that the PQML estimation performs quite well compared to an 
unrestricted QML. At the same time, non-concave penalties, such as the SCAD, are observed 
to perform better than the Li-penalty. Finally, the validity is further confirmed in terms 
of forecasting accuracy of the U.S. yield curve. The performance is measured through a 
comparison with the dynamic Nelson-Siegel model proposed by Diebold and Li (2006), 
which is representative of what exploits factor structures and is known to be superior among 
many other forecasting strategies, including simple VAR(l) and univariate AR predictions; 
see Diebold and Li (2006). 
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The remainder of the paper is organized as follows. In Seotion|2l we first define the 
model and PQML estimator, and then we introduee some assumptions required to derive 
asymptotic properties. Section [3] gives the main theoretical results; the weak oracle and 
oracle properties of the estimator are established. The results are applied to estimation of 
VAR models in Section IH The finite sample performance of the VAR model estimation is 
investigated in Section[51 We also observe the usefulness of the method in real data analysis 
compared to another method in Section Section |7] provides a summary of the paper and 
a discussion of further studies, including sparse MGARCH estimation. All the proofs and 
some lemmas are collected in the Appendices. 

We conclude this section with the introduction of some notation. For some vector x 
and matrix A, these ith and ijth elements are written as Xi and Aij, respectively. The jth 
column (ith row) vector of A is similarly denoted asA.j (A;.). A min (A) and Aniax(A) mean the 
minimum and maximum eigenvalues of A, respectively. ||x|| is the Euclidean norm. ||x||cx, is 
the largest element of x in modulus. ||A|| represents the spectral norm, i.e., a square root of 
Amax(ATA). ||A|| oo refers to the operator norm induced by ||x||cx„ or the largest absolute row 
sum. 

2 Model and assumptions 

In this section, we introduce the model and assumptions required for deriving the theoretical 
results. Our model extends penalized likelihood to general time series models and allows 
high dimensionality, as well as the opportunity to eliminate an i.i.d. assumption on the data. 
In Section 12.1[ the model and PQML estimator are defined. Regarding the objective func¬ 
tion, the penalty functions and log-likelihood function are considered in Sections 12.21 and 
12.3[ respectively. 


6 







2.1 Model setup 


Consider a real vector, stationary, and ergodic time series {y,} for t = 1,..., T. This process 
is assumed -measurable with anatural filtration := O'(Tf), where Yt = ■ ■ ,yt--r^i) 

for some fixed r > 0 or T, = The former case corresponds to VAR(r) mod¬ 

els and the latter to MGARCH models, for example. Let yt have a continuous conditional 
density g{yt\Yt^\). However, g is usually unknown, so we must postulate a parametric fam¬ 
ily of measurable density functions, {f{yt\Yt^i : 0) : 0 E & C M^}, that may or may not 
include the true density g. The parameter vector 0 is p-dimensional, and its “true value” 

0® G int(0) is naturally defined as the unique minimizer of the Kullback-Leibler divergence 
of g relative to /. To get the estimator, we first define the quasi-log-likelihood function as 
Cr( 0 ) = where f,( 0 ) =log/(y,|T,_i : 0 ). 

Regarding the parameter vector 0®, we consider the case where the dimension p di¬ 
verges at the polynomial rate p = 0{T^) for some 5 > 0 , whereas the number of included 
nonzero elements, q (< p), diverges at the slower polynomial rate q = 0{T^) for some 
(5 >) Sq > 0. Therefore, the vector 0® is understood to be filled with many (p — q) zeros. 

It may be possible to allow exponentially diverging parameters, as in Fan and Lv (2011), by 
using a concentration inequality with exponentially decreasing bounds (such as the Azuma- 
Hoeffding inequality) even if the process is not i.i.d. Nevertheless, we do not take such an 
approach, because it is impossible to obtain such a sharp inequality without imposing addi¬ 
tional assumptions in the process. Intuitively, a fast diverging rate is achieved at the cost of 
restricting the classes of processes. 

In order to make the theory clear, the parameter vector 0® = (01*,..., 0®)^ should be 
decomposed into two subvectors. Let denote the set of indices {y G {1,..., p} : 0® 7 ^ 0} 
and 0 ^^ be the (^-dimensional vector composed of the nonzero elements { 0 ® : J G ^q}. 
Similarly, we define 0^c as the {p — -dimensional zero vector. Without loss of generality, 
the vector is stacked like 0 ® = ( 0 ^, 
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In this paper, since the dimension p diverges as T tends to infinity, we consider the 
penalized quasi-maximum likelihood (PQML) estimation of 0® in order to reduce irrelevant 
variables. Let Pt{B) = 'Tfj=iPx{\Bj\) be the penalty term that brings sparse estimation. The 
objective function of the PQML estimation is then defined as 


Qt{0) = Lt{0) — Pt{,B) ■ 


( 1 ) 


The penalty function px is such as the Lp-penalty, with 0 < p < 1, by Frank and Fried¬ 
man (1993), the Li-penalty (Lasso) by Tibshirani (1996), the SCAD penalty by Fan and 
Li (2001), the hierarchical penalty by Bickel et al. (2008), or the minimax concave penalty 
(MCP) by Zhang (2010). The tuning parameter A(= Xj) determines the size of the model. 
We let X = 0(r^“), where a positive number a is specified later. The PQML estimator, 0, 
is defined by Qt{B) = rmxQ^@QT{d). 

2.2 Penalty function 

In this subsection, we start by introducing some notation and discussing the properties and 
assumptions of some types of the penalty functions in ([T]). Define half of the minimum signal 
SiS d{= dr) = min^g^g \0j\/2, and letNo = eW : \\e^^-e^J\oo < d}. Following 
Lv and Fan (2009), we let p{x;X) = px (x)/X and define the local concavity of p at x G M'" 
with ||x||o = r as 



where A^j := {\xj\ — e, \xj\ -I- e). If the second derivative of p(-; A) is continuous, we may 
easily see that K{p;x) = maxi<y<,.—p"(|xj|;A). We sometimes drop X and write them 
simply as p(x) and k{x). The next assumption, which was first introduced by Lv and Fan 
(2009) and was also used in Fan and Lv (2011), characterizes a broad class of penalties. 

Assumption 1 The penalty function p(-; A) is increasing and concave on [0,°°), and it has 
a continuous derivative p'(-;A) with p'(0-l-;A) > 0. In addition, p'{t;-) is increasing on 


8 



(0,oo), and p'(0+;A) is independent of A. 


Assumption 2 Asupe^g^o = o(l)- 

Assumption[T]iniplies k{x;X) > 0 by eoneavity of p and is key to Lemnia[T]in Appendix A. 1. 
Assumption [2] is required to verify a suffieient eondition of Lemma [B whieh furthermore 
leads to the main theorem. 

Fan and Li (2001) advocated penalty functions that endow estimators with three desir¬ 
able properties: unbiasedness, sparsity, and continuity. It is known that the SCAD satisfies 
all of them simultaneously, but the Li and MCP fail to exhibit unbiasedness and continuity, 
respectively. All three penalties, however, satisfy Assumption[T] 

We further give two sets of conditions on the penalty function px to restrict the general 
class of penalties given by Assumption [TJ The first one is used to obtain the so-called 
weak oracle property of the PQML estimator 0 in the next section. This idea was first 
introduced by Lv and Fan (2009) and also obtained in Fan and Lv (2011). The second set 
of conditions is required to achieve the oracle property of 0, a stronger result. (The role of 
positive constants mi and /3 below are understood in the context of As sumption [5] in the next 
subsection.) 

Assumption 3 The penalty function px satisfies the following properties: 

(a) A = 0{T^^) for some a E (0,1/2 —5o/mi —/3); 

(b) d > T^'^logT for some yE (0,1/2] and large T ; 

(c) Ap'(J) =o(^-i/2j-7iogr). 

Assumption 4 The penalty function px satisfies the following properties: 

(a) d/X -E- °° and A = 0{T^^) for some a E (0,1/2 —So/2 —J3); 

(bl) Ap'(J) = 6>(r-i/2). 
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(b2) Ap'(J)=o((^r)-i/2). 


Assumption!?] is stronger than Assumption^ enough to exelude the Li-penalty, whieh is on 
a boundary of Assumption [TJ For a SCAD or MCP, p'^{d) beeomes exaetly zero for a suf¬ 
ficiently large T under Assumption Ufa), implying that|ltb2) andUbl) hold automatically. 
For the Li-penalty, however, p'{d) = 1 holds identically, which indicates that this penalty 
fails to simultaneously satisfyOa) andjUbl). Meanwhile, the Li-penalty can be included in 
the class given by Assumption [3l the condition a > 5o -f 7 is necessary for the Li-penalty 
to satisfy Assumption [3t a) and (c) simultaneously. These features may be observed in the 
following example. 

Example 1 (a) The Li-penalty is given by px{x) =X\x\, and we then obtain p^(|x|) = A 

and pI{\x\)=0. 

(b) The SCAD penalty is characterized by its derivative 

for somea > 2. Then we have p'^{\x\) = —{a— l)“^l{|x| G (A,aA)}. 

(c) The MCP is defined through its derivative = a^^{aX — x)+ for some a > 1. 

Thus, we have p^(|x|) = l{|x| <aX}. 

2.3 Likelihood function 

We introduce some notation and assumptions on the likelihood function Lj. Hereafter, the 
likelihood Lt is supposed to be twice continuously differentiable in a neighborhood 0® C 0 
of 0®. Define the averages of the score vector and Hessian matrix as 

T T 

5r(0) = r-i^^,( 0 ) and //r( 0 ) = 

f=i f=i 

where 5 f( 0 ) = d£t{6)ld0 and ht{6) = d^£ti0)/d6d0^ . For the development of some re¬ 
sults later on, we need to define the averages of the “score subvector” S and the 
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“Hessian submatrix” H^Qx{d), by way of setting s^Qt{6) = d£t{6)jd6^^ (i.e., the sub- 
veetor of 5;(0) that is eomposed of its first q elements) and = d'^£t{9)l 

(i.e., the upper-left q x q submatrix of ht{6)). We also define Sj^cj[Q) analogously. In 
what follows, we oeeasionally suppress the argument (0) when it is evaluated at 0® = 
( 0 ^^, 0); for example, we will simply denote St{9^) = St{9 ^^^, 0) as S®. Define, moreover, 
1j<^ot{ 9) ' = ^[T~'^Yj=iS^Qt{9)sl^^^{9)],Ij(Q :=\imT^ooI^QT,J^oT{9) '■= -E[H^qt{9)], 
and := To derive the theoretieal result, the seore and Hessian require 

some assumptions. Here, we introduee two sets of eonditions. They are used to aehieve the 
weak oraele and oraele properties, respeetively, in the next seetion. The first one is neeessary 
for the weak oraele property. 

Assumption 5 The (quasi-)log-likelihood funetion Lj satisfies the following eonditions: 

(a) For all i G and T, E < oo holds for some mi > 0. 

(b) For all i G and T, E 1™^ < oo holds for some m 2 > 0. 

(e) For all T, is a.s. positive definite, and Aniin(—=: Ci = Op{l). 

(d) There is a neighborhood 0*^^ C 0 of 0*^^ sueh that 

\\H^qt{9\,0) (02,0)11 < Kt\\9\ - 02||cx, 

holds for all 0i, 02 G 0^^ and for some Kp = Op{\). 

(e) There is a neighborhood 0j^^ C 0 of 0^^ sueh that 

sup |||(a/aeT^)s,,..r(e,. 0 )][H,,.„r(e 2 . 0 )i-'||„< 

for some c = Op{\) that takes its value in (0,1) a.s. and some eonstant /3 G [0,1/2). 

Assumption[5l[a) is needed for just a teehnieal reason, but Assumption[5l[b) is meaningful. A 
higher moment eondition in (b) allows estimation of a larger dimensional parameter veetor. 
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If the dimension p is large enough to exeeed the sample size T, the number m 2 is required to 
be strietly greater than two. When a large m,- (i = 1,2) is eoneerned, it seems eumbersome 
to verify these eonditions. However, Lemma [2] in Appendix A.l is useful, as long as the 
seore S obeys a martingale proeess. Assumption Oe) means that is a.s. 

positive and bounded away from zero, and not diverges. If the model is linear in parameter, 
the Hessian does not depend on the parameter, so that As sumption Od) automatieally holds. 
A similar eondition is found in Wooldridge (1994, Theorem 8.1). AssumptionOe) is similar 
to eondition (16) of Fan and Lv (2011). The left-hand side is regarded as a regression 
eoeffieient of eaeh irrelevant variable on important variables in the ease of linear models. 
The right-hand side ean diverge when a folded-eoneave penalty is eoneerned, but the upper 
bound beeomes more restrietive when the Li-penalty is used sinee the ratio p'{0-\-)/p'{d) 
always beeomes one. 

To derive the oraele property, we need a slightly different set of eonditions. For any 
matrix A and veetor x sueh that Ax is well-defined, let ||A|| 2 ,oo := max||;^||=i ||Ax||oo. 

Assumption 6 The (quasi-)log-likelihood funetion Lj satisfies Assumption [5] (b), (e), and 
the following eonditions: 

(d) There is a neighborhood C 0 of 6^^ sueh that 

\\H^oT{ei,o)-H^^T{e2,o)\\ <KTm-e2\\ 

holds for all 0 i, 62 ^ some Kt = Op{l). 

(e) There is a neighborhood 0*^^ C 0 of 6 *^^ sueh that, for some /3 G [0, 0 °), 

sup \\(3lse\)s^^T(eum2.~ = Op(r'*). 

The role of Assumption [^d) is the same as that of Assumption [Sj^d). Condition (e) restriets 
the asymptotie behavior of the lower-left {p — q) x q submatrix of Ht{0) and is similar to 
eondition (27) of Fan and Lv (2011). 
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The next assumption is required to obtain asymptotie normality of the estimator with the 
oraele property, in addition to As sumption 0 

Assumption 7 The (quasi-)log-likelihood funetion Lj satisfies the following conditions: 

(a) For all T, is positive definite and Amax(-^^g 7 ’) < Cf some constant C 2 < °o. 

(b) For any vector a G such that ||a|| = 1, the score vector admits a central limit theo- 

In many cases, a score vector obeys the martingale process, so that a central limit theorem for 
a strictly stationary and ergodic martingale difference sequence is available for Assumption 
|7];b) under general conditions. 

3 Theoretical results 

In this section, we establish the oracle property. This property states that the PQML estima¬ 
tor is asymptotically equivalent to the QML estimator that is obtained with the correct zero 
restrictions. The first theorem is the weaker version of the oracle property. A similar result 
was first obtained by Lv and Fan (2009) and also Fan and Lv (2011) under i.i.d. conditions. 

Theorem 1 (Weak oracle property) Suppose that AssumptionsU} \2l |21 and\^hold. If p = 
0{T^) and q = 0{T^) satisfy the conditions 

5 G[0,m2(l/2 -a)), 5oG[0,(l/2-7)/(l/2 + l/mi)], 5oG[0,r), 

then there exists a local maximizer 0 = of Qt{B) such that the following 

properties are satisfied: 

(a) (Sparsity) d^c — 0 with probability approaching one; 

(b) (Rate of convergence) \\e^^ - 0^J|oc = 6>p(r~^logr). 
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Theorem [T] is somewhat different from the weak oracle property developed by Fan and Lv 
(2011) in that ours is based on the asymptotic result. Also, Theorem [T] gives an opportunity 
for the PQML estimator to achieve model selection consistency even when the Li-penalty 
is employed. To see this, let denote the objective function with px given by the 

Li-penalty. 

Corollary 1 (Li -penalized QML estimator) Suppose all the assumptions in Theorem \T} 
hold. Then there exists a local maximizer 9 = (0^^ of such that the prop¬ 

erties ofTheorem\T\(a) and (b) hold. 

If the (quasi-)likelihood function is globally concave, then the objective function Qi^n is as 
well because the Li-penalty is globally concave. Hence, the local maximizer is extended to 
the unique global one in this case. 

If we employ a SCAD-type penalty, a stronger and more desirable result can be obtained. 
This result is called the oracle property, as studied by Fan and Li (2001). 

Theorem 2 (Oracle property) Suppose that Assumptions \I\^^a)(bl), and hold. If 
5 < m 2 (1/2 — a) is true, then there exists a local maximizer 9 = (0^^, of Qt (9) 

such that the following properties are satisfied: 

(a) (Sparsity) 9^£c — 0 with probability approaching one; 

(b) (Rate of convergence) \\9^^ - 9^J\ = Op((q/T)^/^). 

In addition, suppose Assumption^ holds. If Assumption^bl ) is strengthened with^bl) 
and 5 q < 1/2, then, for any vector a G that satisfies ||a|| = 1, we have: 

(c) (Asymptotic normality) l‘^p^^H'^p(9^^ - 0^J A(0,1). 

This property means that the model selection is consistent in the sense that 0 — 0 with 
probability approaching one, and the estimator has the same asymptotic efficiency as the 
(infeasible) MLE obtained with advance knowledge of the true submodel. Thanks to the 
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property, we ean estimate high-dimensional models without irksome tests for zero restric¬ 
tions on the parameters. 

As is described in Section 12.21 a SCAD-type penalty automatically satisfies not only 
Assumption nth 1) but also Assumption |4l;b2), as long as we choose an adequate A that 
satisfies Assumption a) with sufficiently large T. Again, the Li-penalty fails to satisfy 
Assumption m 

When a fixed q is supposed, result (c) reduces to 

-eX) ^N (o. . (2) 

This asymptotic covariance matrix of the estimator can be consistently estimated by a stan¬ 
dard procedure, as in Wooldridge (1994, Sec. 4.5). Specifically, it is given by 
say, where Jt is an averaged Hessian matrix Ht{B) evaluated by 0, and It is a conventional 
long-run variance estimator of made of a score vector St{B) evaluated by B. 

If the conditional density g{yt\Yt) depends on the infinitely past observations, or Yf = 
(jf, yf_i,...), it is necessary to replace them with some fixed initial values, say %, to perform 
the optimization. We denote it as the it with density conditional on %. The maximizer of the 
objective function based on it, instead of it, is then asymptotically equivalent to B provided 
that 

lim sup \it{B) — If(0)1 = 0, 

in probability. Such a refinement for infinite initial values is typically made in estimation 
of (multivariate) GARCH models; see Section 4.2 of Comte and Lieberman (2003), for 
instance. (See also Hafner and Preminger, 2009 and Ling and McAleer, 2010.) 

4 Application to VAR 

So far, we have studied a general method of PQML estimation for high-dimensional time 
series. In this section, we see how the theory can be applied to estimation of a large di- 
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mensional VAR(r) model. There are several works on estimation in high-dimensional VAR 
models, ineluding reeent works by Basu and Miehailidis (2013) and Kook and Callot (2014). 
However, we will see that our result here is not just a replioation of those studies but is en¬ 
tirely novel to the literature. First, our result requires only a finite fourth moment for the 
error term, while they supposed Gaussian errors. This assumption is essential to their results 
as it admits the dimension p to grow non-polynomially fast; but Gaussian VAR models are 
sometimes too restriotive to capture various economic behavior. On the contrary, we will 
keep the generality of VAR models, including non-Gaussian ones; nevertheless, dimension 
p grows at a polynomial rate. Second, our result can employ many non-concave penal¬ 
ties other than the Lasso, which broadens the opportunities for empirical analysis. To the 
best of our knowledge, all the research on high-dimensional VAR estimation is based on 
Lasso-type procedures. Third, our strategy includes both OLS and GLS estimations, since 
quasi-likelihood is permitted. This will briefly be considered at the end of this section. 

4.1 VAR model and assumptions 

Consider a centered vector process yt characterized by the following k-dimensional VAR(r) 
model: 

y, = 4>°"^jc,-ber, t = l,...,r, (3) 

where = (d>j,... ,4>°) is the parameter matrix, Xt = (y^i, ■ ■ -^yJ^rY, and £t is an error 
term with mean zero and finite positive definite covariance matrix Eg. A constant vector may 
also be included in model ([3]), but we omit it to keep the discussion clear. The autoregressive 
order r and the dimension k are allowed to increase as T tends to infinity. More precisely, 
if we let 0° = vec(d>®^) G with p := k^r, the dimension p may diverge with the rate 
p = 0{T^) for some 5 > 0. The non-sparsity dimension in 0®, q (< p), is also assumed to 
possibly diverge at the rate q = 0{T^) for some 5o > 0. Define, furthermore, Yt = {yt^xJY 
and = <7(Ff). 
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Given Yq, the quasi-Gaussian log-likelihood function (up to a constant term) is given by 

LT{e) = T-^Zj=iit{e)with 

®lk)9^ |yf-(v:7«)4)0}, (4) 

where E is a finite and positive definite weighting matrix and 4 is the kxk identity matrix. 
The PQML estimator is then obtained through ([T]) and dH) with a SCAD-type penalty that 
satisfies Assumptions [U and IH Note that the matrix E is determined by a researcher; if one 
defines E = 4, it leads to the OLS, and if one sets E = Eg, though it is usually unknown, it 
would reduce to the infeasible GLS. 

In order to observe the theoretical result clearly, we introduce a permutation matrix P\ 
that is, P is a matrix with a single one in each row and column and all other elements zero, 
satisfying 9^ = (0^,0^)^. Since any permutation matrix P satisfies the orthogonality 
condition PP^ = Ip, we can put PP^ between {xj <8)4) and 9 in (jH). With simple algebra, 
the corresponding score and Hessian evaluated at 0 = 0° are then given by 

= P^{xt 0E^^and /z® = —P^{xtxj ®'L^^)P, 


respectively. Therefore, we can always find s^, whose elements are ordered like 


by employing some P. We further let = {Iq, Oq^p-q) and K^c ^ {Op-q^q,Ip-q), where 


T 


Oa,b is the ax b zero matrix, and define P'^ := P ' iqxp) and P^c := K]^cP ' {{p 


T nT 


jT 


-T nT 


q) X p). Then there exists a P (and hence P^^) such that the following holds: 


= = . 


iT 


h\t = K\h^K\ = -P4(W 0E-1)4^„. 


(5) 

( 6 ) 


Similarly, we can define and by using Pj^c instead of 4#^ in ([5]) and ®. Notice 
that this manipulation is no more than an auxiliary routine to promote understanding of the 
following proposition. Thus, it is not necessary to specify such a P in an empirical analysis. 

To utilize the results obtained in Section [3l we constrain model dU) to satisfy the next 
assumption. 
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Assumption 8 For each k and r, the following conditions hold: 

(a) The error term £; = (cif,..., is a sequence of i.i.d. random vectors having a 
continuous distribution with 'E\£^t£i 2 t^i-it£i^t\ < Ce for some constant Ce < and for 
all t and ii,Z 2 ,* 3,4 £ {1, • ■ ■ 

(b) For all k and r, det(4 — - ^ 0 for |z| < 1 . 

(c) For each £ G {1,..., A:}, all the elements of the row vector are zero except the 

finite numbers of them. The number does not depend on k. 

As sumption [8t a) is standard in stationary multivariate time series analysis to achieve asymp¬ 
totic results; see Liitkepohl (2005, p. 73). This condition is also used to verify Assumption 
[5];a)(b) with mi = m 2 = 4. A higher moment condition is, of course, required if one wants to 
make Assumption [5ta)(b) hold with mi and m 2 larger than four. Assumption [8tb) is essen¬ 
tial to ensure that model dU) has no unit root and is stable, in the sense of Liitkepohl (2005, 
p. 15). This condition together with Assumption [5t a) guarantees stationarity of the model. 
Assumption [8];c) restricts the asymptotic behavior towards k’s direction for some technical 
reasons. 

4.2 Result 

Now we have made the preparations to derive the oracle property for the PQML estimator 
of VAR model dS]). 

Proposition 1 Suppose that Assumptions\T\ |?] and^are satisfied. Then all the conditions 
in Assumptions^and^are true with mi = m 2 = 4 and fi = 5 q/ 2, and the result of Theorem 
^holdsfor PQML estimator 0 of model ^ if 5 < 4(1/2 — a). 

We treat < 2 ' as a finite number in Proposition [T] to construe the subsequent asymptotic 
result. The asymptotic distribution is then given by d2l) with 7^^ = (8)E^^EeE^^)P^g 

= P\ (r 0 E -1 )P^g, where F = E [xjxJ]. 
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In a multiple equation model, as was made well known by Zellner (1962), the OLS 
and GLS estimators are identieal as long as the regressors are the same in all equations. 
The typieal example is a VAR model. However, the assumption eollapses onee parameter 
restrietions are eonstrained, and these estimators do not beeome identieal; see Liitkepohl 
(2005, See. 3.2.1 and See. 5.2). Proposition[T]is a eomparable result. lip> q, the asymptotie 
varianee of 9^^^ is found to be 

Henee, if E is set to 7 or Eg in (|7]), we have 

(Pk(r®fl'Vo)”‘fi(r®Ee)f.,«-„(Pi,(r®/)7V,)-‘ if i = /; 

(^’i(r®Ej‘)f’,#o)“‘ if E = E£. 

On the other hand, when we eonsider the ease p = q, we have = Iq and henee = P, 
whieh eorresponds to the ease where no penalty is eonstrained. With simple algebra, it is 
easy to see that (|7]) reduees to 7’^(r~^ ®'Le)P, meaning that the resulting estimators are the 
same, notwithstanding the ehoiee of E. This signifies that the PQML method is essentially 
the same as the parameter-restrieted QML; a point of differenee is that the parameters’ 
eonstraint to zero is automatieally imposed. 

5 Simulation study 

In this seetion, we eheek the estimation aeeuraey of the PQML estimates of the VAR model 
analyzed in Seetion HI VAR model dU) is now speeified with k = 8, r = 2, and parameter 
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matrices given by 
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0> 


We may easily eonfirm that this formulation entails Assumption [8l The error term £/ is 
assumed to be distributed as a serially uneorrelated multivariate normal N{0, Eg), where the 
eovarianee matrix Eg is speeified through deeomposition UU~^, with 

.1 0 0 0 0 0 0 ^ 

0.3 0 0 0000 

00.9 0 0000 

0 0 .2 .4 0 0 0 0 

0 0 0 -.2 .3 0 0 0 

000 0 0.3 00 

000 0 00.3 0 

000 0 0 0 0 .3J 


U = 


Validity of the PQML estimation is measured by comparison with other methods, in¬ 
cluding the usual unrestrieted and eorreetly zero-restrieted QML estimations; the former 
appears to have ineffieient estimates, and the latter, though infeasible, is expeeted to result 
in highly effieient estimation. All the strategies are based on the QML in the sense that 
E in Q is given by /g. That is, the unrestrieted and restrieted QML estimates agree with 
the unrestrieted and restrieted OLS estimates, respeetively. The penalty terms in the PQML 
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estimations include the SCAD, MCP, and Lasso, and the tuning parameter a is set to 2.5 
and 20 for SCAD and 1.5 and 20 for MCP. These PQML estimates are computed using the 
coordinate descent algorithm provided by the R package ncvreg; see Breheny and Huang 
(2011) for detail. Note, however, that the package automatically includes the intercept in 
the model, so that the code is adjusted to meet the present no-intercept model. We must also 
choose a tuning parameter A. Since we have not revealed an optimal way to do so, we use 
the conventional method of selection by 10-fold cross-validation, which leaves out 1/10 of 
the data. If the data were i.i.d., the issues on tuning parameter selection could be addressed 
by several methods, elaborated in other works; see Fan and Tang (2013) and the references 
therein. 

Estimation accuracy is checked by observing three measures: a root mean squared error 
(RMSE), standard deviation (STDEV) and success rate on model selection, and sign con¬ 
sistency (MSSC) for PQML estimates. These statistics are constructed on the basis of 100, 
300, 500, and 1,000 observations, with 1,000 replications each. As the parameter vector 
vec (4>o,4>0) is large (128-dimensional), it is difficult to display all the results. Hence, they 
are evaluated by the distance between vectors measured by the L 2 -norm. 

The simulation results are summarized in Table [IJ At first glance, the three penalized 
estimations greatly succeed in gaining efficiency in regard to RMSE and STDEV, compared 
to the conventional MEE. In addition, these penalized estimates possess high success rates 
for model selection (MSSC), except when nonzero entries are estimated with n = 100 and 
small a’s. In this case, some nonzeros are misconstrued as zeros. Such a defective behavior 
is, however, eliminated immediately as the sample size gets larger. Taking a closer look at 
the columns of SCAD and MCP, we notice that, for small a, MSSC rates tend to be better 
and biases are reduced in exchange for sacrificing, to some extent, the values of RMSE and 
STDEV; and, for large a, the reverse behavior is observed. Although the Easso exhibits 
a good performance in terms of RMSE and STDEV, especially in the cases of moderate 
sample size, it seems to cause a slightly larger bias relative to the SCAD and MCP, which is a 
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well-known phenomenon in the literature; see Fan and Li (2001), for instance. Moreover, the 
MSSC rate of the Lasso is not improved even when the number of observations is increased. 


6 Real data analysis 


The usefulness of the proposed sparse VAR (sVAR) estimation method is clearly observed 
when looking at its performance in forecasting the term structures of government bond 
yields. We use zero-coupon government bond yields of the U.S. at a monthly frequency 
from January 1986 to December 2007 with 8 maturities: 3, 6, 12, 24, 36, 60, 84, and 120 
months. This dataset is constructed by Wright (2011) and can be downloaded from his web¬ 
site. The performance is confirmed by comparing the out-of-sample forecasting accuracy of 
our method to that obtained from the dynamic Nelson-Siegel (DNS) model accommodated 
by Diebold and Li (2006), as it is a representative model utilizing a factor structure to reduce 
the high-dimensionality (it is briefly explained later). Both the models are estimated recur¬ 
sively, using data from January 1986 to the time that the /z-month-ahead forecast is made, 
beginning in January 2001 and extending through December 2007. Therefore, we obtain 
12 — h-\-\ forecast values for each model. The forecasting horizon, h, is given by 1,3, 6, 
and 12 months, respectively. The accuracy of the forecast is measured by the RMSE relative 
to the actually observed yields. 

Now we explain the two models, DNS and sVAR. The first model, DNS, has achieved 
dimensionality reduction via factor structure of yields and is well-known in the economet¬ 
ric literature for its superiority in forecasting; see, for example, the empirical examples in 
Diebold and Li (2006) and Kargin and Onatski (2008). It is a suitable competitor for sVAR, 
since it has the best performance among several models, including simple time series models 
like VAR(l). DNS models yield y^t of maturity T at time t as 


Jn — Pit + Pit 


V vt^ ) 


+ P3t 


V itT 
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The coefficients Pu, Pit, and Pit niay be interpreted as latent dynamic factors corresponding 
to the level, slope, and curvature, respectively, because of the construction of their load¬ 
ings. rjt is a sequence of tuning parameters. By the construction of the model, prediction 
of the yields is equivalent to that of the /3’s in the sense that estimated yields are obtained 
immediately once /3’s are determined. For more information about the model and its vari¬ 
ants, see Diebold and Rudebusch (2013). According to a recommendation of Diebold and 
Li (2006), rjt is determined by maximizing the loading on the medium-term factor for each 
period (although Diebold and Li, 2006 fixed the value at 0.0609 for simplicity); see Diebold 
and Li (2006, pp. 346-347) and Guirreri (2010, p. 9). The strategy of parameter estimation 
is basically the same as in Diebold and Li (2006) and is implemented using the R package 
YieldCurve. These estimated parameters are modeled as univariate AR models, with Pn 
regressing on 1 and Pi^t-h for 1 = 1,2,3. We then obtain /z-step-ahead forecasts Pi^t+h by Pit. 

The second model, sVAR, demonstrates the difference of yields directly as a VAR model 
of lag order 12, with the assumption that the coefficient matrices are sparse. The model is 
estimated with a SCAD-penalized QML, reducing the high-dimensionality and performing 
model selection. The computation is done with the aid of the R package ncvreg, as in the 
preceding section. In this case, however, the tuning parameter A is simply set to (8r)^®'"^/4 
because cross-validation for the present large model, which has 8^ x 12 = 768 coefficient pa¬ 
rameters of interest, is extremely time-consuming. The forecasting strategy is quite simple: 
differenced yields are predicted recursively over /z = 1 to 12, with the estimated parameters 
fixed at each time t that the forecast is made. It should be noted that we do not compare to 
a simpler and more parsimonious model such as VAR(l), since it has already been reported 
to be inferior to the DNS model in Diebold and Li (2006). 

The results are summarized in Tables [2] and [H The former reports RMSEs of out-of- 
sample predicted yields (relative to the actual ones) obtained by each model, and the lat¬ 
ter shows the ratios of RMSEs-by-sVAR to RMSEs-by-DNS. The impact of the results is 
highly impressive and straightforward: that is, the superiority of the sVAR model can easily 
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be observed. In fact, RMSEs-by-sVAR are uniformly lower than those by DNS in all the 
maturities T for each prediction period h. Although there are many versions that improve 
the original DNS, sVAR is undoubtedly a strong competitor in terms of forecasting. 

7 Conclusions 

We present our conclusions in two parts. First, we consider another potential application 
of our method in the estimation of high-dimensional MGARCH in Section ItTI and Section 
17.21 presents concluding remarks. 

7.1 Another potential application 

Regarding the application of the general theory, this paper has focused only on a VAR model, 
in spite of the potential applicability to the other stationary models of high-dimensionality. 
For instance, we may apply the same theory to estimation of high- (or even moderate-) 
dimensional MGARCH models. 

Although there are several variants of MGARCH, the so-called BFKK(1,1) model is 
known to be comparatively general among them. This particular model supposes that the 
conditional variance E; of a k-dimensional stationary process yt is modeled by 


The parameter Cis akxk lower triangle matrix and A and B axe kxk matrices which vary 
freely, so that the number of parameters to be estimated amounts to (5k-|- l)k/2. One char¬ 
acteristic that helps avoid estimating many parameters is that the model is made to specify 
parameter restrictions in advance. For example, constant conditional correlation GARCH 
restricts the parameters to make the model possess constant correlations, while modeling 
the conditional variances by univariate GARCH models; see, for example, Bauwens et al. 
(2006), Silvennoinen and Terasvirta (2009), and Francq and Zakoian (2010). However, such 
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modeling may be too restrictive in that the parameter constraint is imposed somewhat artifi¬ 
cially. 

Of course, while there are many improved models in that direction, it seems more ap¬ 
propriate that the effective parameters be selected by observed data in a general model such 
as BEKK(1,1). This direction could be pursued with the method proposed in the paper. In 
such a case, we must take care when imposing penalties on the parameters. If all the param¬ 
eters are equally constrained, it might be possible that some diagonal elements of Ef, which 
should not be zero, might be forced into becoming so. 

7.2 Concluding remarks 

In this paper, we have first developed a general framework of the penalized quasi-maximum 
likelihood (PQML) estimation that results in sparse estimates of high-dimensional parame¬ 
ter vectors in stationary time series models. As far as theoretical results, the so-called oracle 
properties of the estimators are explored under the assumption that the number of parameters 
of interest grows at a polynomial rate. It should be underscored that SCAD-type penalties 
are recommended instead of the L\ penalty, which is frequently used in the time series con¬ 
text, because of the bias reduction. The theoretical results in the first half of the paper have 
been derived under the assumption of high-level conditions, and hence a researcher should 
verify the conditions for each model he/she wants to consider when using this approach. 

Second, large but sparse VAR models have been examined as an application of the results 
achieved in the first half. Specifically, assumptions under which the oracle property holds 
have been suggested. We may hereby avoid irksome tests for zero restrictions of coefficient 
matrices and accept instead a high-dimensional VAR model, as long as the model is sparse. 
The validity and superiority has been confirmed through a Monte Carlo simulation and an 
empirical example. In particular, we have observed from the empirical analysis that, even 
when the dimension is large, the PQML estimation sufficiently reduces the innate trouble in 
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VAR models, which can often suffer terribly from high-dimensionality. 

It goes without saying that there are many issues that must be addressed in the future. In 
particular, there is a limitation in regards to the use of the tuning parameter A. In the case 
of the i.i.d. assumption, much attention has been paid to how to determine A; see Fan and 
Tang (2013) and references therein. However, this paper, which has treated dependent data, 
has presented no guideline on the choice of “good” tuning parameters A (and a). Future 
research is necessary to indicate the best way to set tuning parameters in empirical studies. 
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A Appendices 


A.l Useful lemmas 


In Lemma [U below, let ^ := supp(0), a set of indices corresponding to all nonzero com- 
ponents of 0, and 6 £ denote a subvector of 0 formed by its restriction to The other 
symbols are defined analogously. Let 0 denote the Hadamard product. The sign function 
sgn(-) is applied coordinate-wise. 

__ A 

Lemma 1 Suppose that Assumption [7] holds. Then 0 is a strict local maximizer of the 
penalized quasi-likelihood (PQL) 2r(0) defined in ([7]) if 


s^Ti^)-^Tp\\e^\)Qsgn{e^fi=o-, 

ll°° ^ (0-1-); 


( 8 ) 

(9) 

( 10 ) 


Proof The proof follows from Fan and Lv (2011). We first consider the PQL Qt{0) defined 
in dU) in the constrained ||0||o-dimensional subspace S := {0 G : 0“^ = 0} of where 
0*^ denotes the sub vector of 0 formed by components in . It follows from condition (fTOl) 
that Qt{0) is strictly concave in a ball Mq C S centered at 0. This, along with dS]), entails 
that 0, as a critical point of 27’(0) in S, is the unique maximizer of Qt{B) in Mo¬ 
lt remains to show that the sparse vector 0 is indeed a strict local maximizer of Qt{B) 
on the whole space Take a small ball Ni C centered at 0 such that Mi fiS C Mq. We 
then need to show that Qt{B) > 2r(7i) for any 7 i G Mi \ Mq. Let 72 be the projection of 71 

A A A 

onto S, so that 72 G Mq. This ensures that Qt\Q) > Qt{Y 2 ) unless 72 = 0, since 0 is a strict 
maximizer of Qt{0) in Mq. Thus it suffices to prove 2r(72) > 2 r( 7 i)- 
By the mean value theorem, we have 

2 r( 7 i) - 27 ’( 72 ) = (71 -72), 
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where the vector 70 lies between y\ and 72 . Note that the components of 71 — 72 are zero for 
their indices in ^ and sgn( 7 oy) = sgn( 7 ij) for j e Therefore, we have 


- 72 ) = >5r(7o)^(ri - 72 ) - Ar[p'(|7ol) ©sgn(7o)]^(7i - 72 ) 

= ‘^.^^r(7o)^7i^v-Ar Y. P'(l7ojl)l7ijl, (H) 

where is a sub vector of 71 formed by the components in . It follows from the 
concavity of p in Assumption |4] that p' is decreasing on [ 0 , 0 °). By condition @ and the 
continuity of 5„(-), there exists some 5 > 0 such that, for any 0 in a ball in centered at 

A 

0 with radius o, 

||S^.2-(0)||oo<Arp'(5). (12) 

We further shrink the radius of ball Ni to less than 5 so that | 7 oy| < | 7 i^| < d for 7 e 
and (fT^ holds for any 0 G Ni. Since 70 G Ni, it follows from (fT^ that (fTTI) is strictly less 
than 


^rp'(5)||7i^c||i-Arp'(5)||7i_^^,||i=0 

because of the monotonicity of p'. Thus 2r(72) > 2r(7i) holds and the proof completes. 

□ 


Lemma 2 Let Wt be a martingale difference sequence with for all t, where 

m> 2 and is a constant. Then we have 







The key to this lemma is a Marcinkiewicz-Zygmund inequality for martingales (e.g., Rio, 
2013, p. 61), but the Burkholder-Davis-Gundy inequality (e.g., Davidson, 1994, 15.18) is 
also used. The same assumption is found in Medeiros and Mendes (2012). 
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Proof of Lemma|2] A Marcinkiewicz-Zygmund inequality for martingales (e.g., Rio, 2013, 
p. 61) states that 

/ y, \ m r 

holds for m > 2. Thus, by the eondition E |wf |'” < for all t, we ean easily observe that 

/ T j, 

T~ml 2 ^ f j < {4m(m- l)}™/2r-i ^ElWfl™ < {4m(m - l)}™/2c^. 

This eompletes the proof. □ 

A.2 Proofs of theorems 

Remember that = (0i,..., 0^)^. Eor notational simplieity, we sometimes write, for 
example, Qt{{Q\, as 

Proof of Theorem [T] Eet = supp(0®). Consider the events 

4 = {llAorll- < {q-'"'T) and 4 = < Ap'( 0 +)log-‘r}, 

where q = 0{T^) and X = 0{T^^). It follows from Bonferroni’s inequality and Markov’s 
inequality together with Assumption [5ta)(b) that 

>1- ^ ^ p(|ri/250^| >ri/2-“p'(o+)) 

^ q\o%T ^'^r™2(i/2-«)p/(o+)'«2iog-™2r 

= 1 - 6 >(log^^ T) - (9(p5~™2(l/2-«) logmz (13) 

where the last two terms are o(l) beeause of the eondition 5 < — oc). Under the 

event S’j n we will show that there exists a solution 0 G to eonditions (f 8 ])- (fT 0 l) with 
sgn( 0 ) = sgn( 0 ®) and ||0 — 0 ®||<x, = 0 (r~^logr) for some 7 G ( 0 , 1 / 2 ]. 
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Step 1. We first prove that, for a suffieiently large T, equation dH) has a solution 0 
inside the hypereube 

N = € R«: ||e^„- = r-nogT] 

when we suppose ^ = ^q. Define the funetion 'P : —)■ by 

Then eondition ([8]) is equivalent to =0. To show that the solution is in the hyper¬ 
eube N, we expand around 0^^. Funetion (fT4l) is written as 

= «l„T(e^o - - Ap'(ie^„l) 0s8n(e4-o)l 

+ IH^oT(ei.o) - ei„) 

where 0^^ lies on the line segment joining 6 and 6^^. Sinee the matrix is invertible 
by Assumption [5te), (fT5l) is further written as 

= + + =: 9 + vj + wr (16) 

when T is suffieiently large. We then bound the last two terms, vt and wt, in (fT^ . 

First, we deal with vt- For any 0^^ G N, it holds that 

min |0jj > min |0®| — dj = dj > T~^\o%T (17) 

by Assumption [3];b), and sgn(0^Q) = sgn(0^^). Using the monotonieity of p'(-) in As¬ 
sumption [T] with (fTTI) and Assumption (S^e), we obtain 

||Ap'(|0^j)0sgn(0^j||.„<ApV)=o(^-i/2r-nogr). 
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This, along with the property of matrix norms and Assumptionj^e), entails that on the event 

||vr||oo = [5'^or ~^P (l^^ol) 

(l|5Vll-+llV(l^-^ol)0«gn(0^o)ll-) 

= Op{T-nogT), (18) 

where the last equality follows from q = 0{t4 and 6o < (1/2 — 7 )/(l /2 + 1/mi). 

Next, we eonsider wj- By the property of norms and Assumption|5];c)(d), we have 

ll^-HU = - 0\)iu 

< qO,{i)KT\\e'^^ - ei,JUI|e#o - 

Therefore, sinee Kt = Op{l) and q = 0 {t4 with 5o < j, 

\\wT\\oo = qOp{T~^nog^T) =Op{T-nogT) (19) 

holds, provided that 0,- — 0? = T'^logT for all i e ^q. By (fT^ . (flSl) . and (fT9l) . for a 
suffieiently large T and for all i G ^q, we thus have 

'B,(0^j>r-nogr-||vr||oc-||wr||oc>o if 0,-0« = r-nogr; (20) 

'B,-(0^j<-r-nogr + ||vr||^ + ||M)r||^<o if 0,-0« = -r-nogr. (2i) 

By the eontinuity of ^(■) and inequalities (l20l) and (|2TI) . an applieation of Miranda’s exis- 
tenee theorem shows that the equation 'T(0 _^q) =0 has a solution 0in N. Clearly, 0 
also solves the equation ^1^(0^^) = 0, in regard to the first equality in (fT^ . Thus, we have 
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shown that dU) indeed has a solution 0 in N. 


Step 2. Let 0 = (0^^, ^ with G M as a solution to dH) and O^c = 0. 

Next, we show that 0 satisfies dH) on the event Sj. By the triangle inequality and mean 
value theorem, we have 

<p(o+)iog->r+A-'||(a/ae^)s.^.He5„.o)(e#„-eii,)||». (22) 

where 0j^^^ lies on the line segment joining 0 ^q and 0^^. The first term of the upper bound 
in is negligible, so that it suffices to show the second term is less than p'(0+). Since 
0 ^0 sol^'es the equation = 0 in dT4l) . we get 

-^p'(l^^ol) 0sgn(0^o) = 0 , 

with 0^^ lying between 0^^ and 0^^. Note that H 0) is invertible with probability 
approaching one from Assumption [5tc)(d). Hence, from Assumption Oe), the last term of 
(l22l) is evaluated as 


<x-' sup i|(5/39^,)s^t(0i,o)[h,«-,h02,o)1-‘iu(iixVII"+II^p'(I®.<«'oI)II-) 

ei,e2eN ^ '' 


< A 1 


' cp'i0+) 
_ p'{d) 


A6>p(r^) 


/T) 1/2 logl/2 T + Xp'{d)) 


= /T)^/^\og^/^T + cp'{Q+). 


(23) 


Since the first term in the final equation of (l23l) is equal to Op(r“+i^+^/™'~i/2logi/2 T), 
which is Op{\) by Assumption (Sj^a), (l23l) is eventually less than p'(0+). This verifies con¬ 
dition dH). 

Finally, condition dlQl) is guaranteed by Assumption [5tc)(d) for a sufficiently large T. 
Therefore, by LemmafU we have shown that 0 = (0^^, 0^0^ is a strict local maximizer of 
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the penalized likelihood Qr(0) in O with ||0 — 0°||oo = O(r ^logT) and 9^c — 0 under 
the event S'y Thus, the proofs of (a) and (b) are eompleted by (fT3l) . □ 

Proof of Theorem 1^ First, we show results (a) and (b) through the following two steps. 

Step 1. We eonsider Qt{0) in the eorreetly eonstrained spaee {0 : O^c = 0 e 

and foeus on the ^-dimensional intrinsie subspaee {0^^ G The corresponding 
PQL is given by 

Qt{9^o,0) =Lt{9^q, 0) - Y,PxiWj\)- (24) 

f=i 

We now show the existence of a strict local maximizer 9^^ of Qt (, 0) such that 11 9^^ — 
9^J\ = Op{{q/Ty/^). To this end, it is sufficient to prove that, for any given e > 0, there 
is a large constant C such that 

pI sup er(ei-„+“('3/O‘G0)<fir(8i„.0 ))^‘-«' 

Vll»ll-C / 

This implies that, with probability tending to one, there is a local maximizer 9^^ of Qt{9^q, 0) 

ill the ball e R« : \\ 6 ^„ - 8^J| < Recall that p(t) = px(l)/k. We have 

RtM 1 = er(8i + ata/r)‘G0)-er(8i.0) 

-Art (p(|8" + <t;ta/r)‘/"|)-p(|e"|)) - W + W- 

f=l 

First, we evaluate {II) . The Taylor expansion gives 

p(|e»+a;(p/r)‘/2|)-p(|e«|) = p'(|ef|)(|e»+„x?/J')‘'V|8“l) 

<p'(dT)(q/T)'/^\uj\, ( 26 ) 

where 10®*| lies between 10°| and \9j +Uj{q/T)^^^\, and the last inequality follows from the 
monotonicity of p'(-) and the triangle inequality. By (|2^ . the Cauchy-Schwarz inequality, 
and AssumptionUbl), we have 

|(;;)|<Arp'(dr)(8/7')‘-'"t l")l 

;=i 
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< XTp'{dT){q/TY/^q^^^\\u\\ = 0{q/T)\\u 


Next, we eonsider (/). By the Taylor expansion, we have 

(/) = te/r)'/’-sVlT" + =: (h) + (h), 

where the veetor 0^ lies between 0^^ and 0^^ + By the Cauehy-Sehwarz in¬ 

equality and Assumption [^a), together with the Markov inequality, we obtain 


whereas, by Assumptions Oe) and^d), we get 


(/ 2 ) = 






u < 


q/T 


Cl(l+Op(l))||M| 


where C\ (> 0 a.s.) is the minimum eigenvalue of Beeause {I 2 ) dominates (//) and 

(/i) when a large value of ||m|| is taken, sup||j^||^(-i?r(w) tends to negativity as T grows large. 
Thus, (1251) holds. 


Step 2. To complete the proof of (a) and (b), it remains to show that 0® := (0^^,O) 
is indeed a strict local maximizer of Qt{0) in M^. From Lemma [H it suffices to check 
conditions dU), dH), and (fTOl) while setting 0 = 0®, but condition dS]) is clearly satisfied by the 
argument so far and the Karush-Kuhn-Tucker condition. Condition (fTOl) is also satisfied by 
Assumptions Oc) and[^d). To verify dH), we consider the event 

By the Markov inequality and Assumptions Ob), we have 

P{S^) > 1 - ^ > ri/2;^rp'(0+)/2) 
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From the assumed eondition d < m 2 (1/2 — a), P{Sj) tends to one as T goes to infinity. 
Thus, it suffiees to eheek whether eondition dH) holds on the event Sj by a similar argument 
as in (I 22 I) and (l2^ . From Assumption [^e), we obtain 


<p(o+)/2+A-'sup||(a/M^„)s^.Hei.o)|l2,-ll(e4-„-e^„)ll 

61 

= p(o+)/2 + Op(r«+^)(^/r)i/2, (28) 

where lies on the line segment joining 0^ and 0^^. The last term in (|2^ is Op(l) by 
Assumption llta), whieh eompletes the proof of (a) and (b). 

Finally, we prove (e). To this purpose, it suffiees to show the asymptotie normality of 
0^Q. By the first order eondition {d/dd^^)QT{B^(^,0) = 0 and Taylor expansion of the 
likelihood funetion, we have, for any veetor a G sueh that ||a|| = 1, 

= T I e #01) ® sgn(e.^.) 

+l« 0) - ( eji,, - e \) 

=: (///,) +(/// 2 ) + (/// 3 ), 

where 0^^ lies between 0and 0^^. The proof is eompleted if we prove {llli) and {llh) 
are Op(l) beeause (///i) is asymptotieally standard normal by the direet use of Assumption 
Hb) and the Slutzky lemma. By the Cauehy-Sehwarz inequality. Assumptions Ha), and 
Ill;b2), we obtain 

|(///2)| = r‘/g|a'^4;y"Arp'(|e^J)0sgn(e^.)| 

< < T'<"-q'I^C2o((qT)-'l^) = 0 ( 1 ). 
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Similarly, by the Cauchy-Schwarz inequality, Assumptions [^c), and Ha), we obtain 


l(;;;3)| = -hV](8.-^o - 9\)l 

< T'l^CoKrWe*^^ - e%j\\e,^„ - e^W = = o„{\), 

and the proof is eompleted. □ 


Proof of Proposition[T] The proof is eompleted if Assumptions[5ta)-(e),Hd)(e), andHa)(b) 
are verified. First, we show that Assumption [5];a)(b) are true for mi = m 2 = 4. Using the 
property of the vec-operator and Kronecker produet (e.g., Liitkepohl, 2005, A.12.1), and 
letting Yec{£ixJ ) =: (wif,..., Wpt)~^ = Wt and be the (/, 7 )th element of , we see that 


= ^)£t = P~^{Ikr^'^ 

k 

= P^ Y. • ■ ■ ’ o^^Wk+j,t, ..., 

,/=i 

• ■ ■ : (^^^W{kr-\)k+j,t, ■■■, ^^^^{kr-l)k+j,lV ■ 


Henee, it suffices to consider a typical element ^ where £ G {1 ,... , k} 

and h G {0,1,.. .,kr- 1}, and prove tiyatE\T^^/^Yj=iShk+£,t\‘^ < Note that Ef=i 
is a martingale because is a martingale difference sequence with respect to ^t-\- 

This proof is thus completed if we show E < 0 ° in view of Lemma H To see this, 

we notice that ..., are zero except a finite number of them, by AssumptionHc). Let 
{7 ^ {I 5 • • ■ 5 ^} • <7^'^ 7 ^ 0} and k := ll^fllo- Applying the C^-inequality repeatedly, we 


have 


E 5 


h I 
hk+£,t I 



k 

4 

E 

Y ^^^^hk+j,t 

= E 


,/=l 



< 2^^ ^ E 


0^-'whk+j,t 




< 2^*kmax |a^'^|‘^maxE Iw^+y? 






since E\whk^j^t\^ < 0 ° is easily observed from the law of iterated expectations under As¬ 
sumption [8];a)(b). Thus, the result follows. 
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Next, we eheek Assumptions Oe) and^d), butl^d) is elear beeause the Hessian does 
not depend on 0. We then have S'") elear from the 

eonstruetion of this Hessian together with Assumption[8ta) and Rule (6) in Liitkepohl (2005, 

p. 661). 

Next, we verify As sumption [^e) with /3 = 6o/2. To this end, it is suffieient to show, for 
every i G {l,...,p}, max||^||=i | = Op{T^/'^) for a veetor v G Using 

the Cauehy-Sehwarz inequality and property of the norm, the left-hand side is bounded by 
II .. .,Hfqx) II ^ <?^/^maxi<;<^ I^O'rl- Sinee Hf-j = Op{\) under Assumption [8ta)(b) 
and q = 0{T^), the result follows. 

Finally, we observe Assumption |7];a)(b) are satisfied. Sinee F = ElxtxJ], and Eg 
are finite and positive definite, T (gjE^^EgE^^ is finite and positive definite. Thus, 7*^^ = 
7(^^(r(8)E^^EeE^^)7(^^ is finite and positive definite. Sinee ^ stationary 

ergodie martingale differenee sequenee with 



for any a G sueh that ||a|| = 1, we have ~^d N{0, 1). □ 
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Table 1: Estimation accuracy 




Oracle 

MEE 

SCAD 

MCP 

Easso 

T\a 


- 

- 

2.5 

20 

1.5 

20 

- 

100 

RMSE 

0.374 

1.634 

0.984 

0.863 

1.043 

0.865 

0.871 

300 


0.214 

0.885 

0.559 

0.494 

0.596 

0.494 

0.524 

500 


0.164 

0.678 

0.416 

0.378 

0.450 

0.378 

0.411 

1000 


0.116 

0.478 

0.251 

0.262 

0.281 

0.262 

0.295 

100 

STDEV 

0.373 

1.632 

0.837 

0.701 

0.904 

0.705 

0.698 

300 


0.214 

0.884 

0.510 

0.426 

0.540 

0.426 

0.435 

500 


0.164 

0.678 

0.394 

0.331 

0.419 

0.331 

0.346 

1000 


0.116 

0.478 

0.264 

0.229 

0.276 

0.230 

0.250 

100 

MSSC [%] 

- 

- 

89.2 

84.9 

91.4 

85.7 

83.1 


(nonzero) 

- 

- 

(66.5) 

(82.7) 

(52.8) 

(81.6) 

(84.4) 


(zero) 

- 

- 

(92.3) 

(85.2) 

(96.5) 

(86.3) 

(82.9) 

300 


- 

- 

91.0 

85.2 

94.5 

86.3 

81.2 


(nonzero) 

- 

- 

(86.8) 

(95.0) 

(78.9) 

(94.7) 

(95.2) 


(zero) 

- 

- 

(91.5) 

(83.9) 

(96.5) 

(85.2) 

(79.3) 

500 


- 

- 

91.9 

85.8 

95.5 

86.9 

80.2 


(nonzero) 

- 

- 

(93.3) 

(97.5) 

(86.8) 

(97.3) 

(97.9) 


(zero) 

- 

- 

(91.8) 

(84.2) 

(96.7) 

(85.5) 

(77.9) 

1000 


- 

- 

94.6 

87.9 

97.3 

88.8 

79.7 


(nonzero) 

- 

- 

(98.1) 

(99.2) 

(95.8) 

(99.1) 

(99.5) 


(zero) 

- 

- 

(94.1) 

(86.4) 

(97.5) 

(87.4) 

(77.1) 


42 











Table 2: RMSEs of out-of-sample forecasts 



h\T 

3 

6 

12 

24 

36 

60 

84 

120 

DNS 

1 

0.519 

0.559 

0.760 

1.018 

1.156 

1.220 

1.145 

0.984 


3 

0.784 

0.873 

1.128 

1.438 

1.564 

1.548 

1.400 

1.164 


6 

1.102 

1.224 

1.511 

1.838 

1.923 

1.813 

1.603 

1.311 


12 

1.806 

1.952 

2.134 

2.287 

2.252 

2.023 

1.769 

1.454 

sVAR 

1 

0.185 

0.168 

0.219 

0.284 

0.308 

0.310 

0.295 

0.271 


3 

0.328 

0.343 

0.404 

0.496 

0.521 

0.501 

0.461 

0.414 


6 

0.614 

0.628 

0.669 

0.745 

0.752 

0.688 

0.610 

0.524 


12 

1.128 

1.153 

1.153 

1.126 

1.055 

0.895 

0.770 

0.647 


Table 3: Ratios of RMSEs of out-of-sample forecasts 


h\z 

3 

6 

12 

24 

36 

60 

84 

120 

sVAR/DNS 1 

0.356 

0.301 

0.288 

0.279 

0.266 

0.254 

0.258 

0.275 

3 

0.418 

0.393 

0.358 

0.345 

0.333 

0.324 

0.329 

0.356 

6 

0.557 

0.513 

0.443 

0.405 

0.391 

0.379 

0.381 

0.400 

12 

0.625 

0.591 

0.540 

0.492 

0.468 

0.442 

0.435 

0.445 
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